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Abstract 

Knowledge base (KB) completion adds 
new facts to a KB by making inferences 
from existing facts, for example by infer¬ 
ring with high likelihood nationality(X,Y) 
from bornIn(X,Y). Most previous methods 
infer simple one-hop relational synonyms 
like this, or use as evidence a multi-hop re¬ 
lational path treated as an atomic feature, 
like bornIn(X,Z) —>• containedlnfZ, Y). This 
paper presents an approach that reasons 
about conjunctions of multi-hop relations 
non-atomically, composing the implica¬ 
tions of a path using a recurrent neural 
network (RNN) that takes as inputs vec¬ 
tor embeddings of the binary relation in 
the path. Not only does this allow us 
to generalize to paths unseen at training 
time, but also, with a single high-capacity 
RNN, to predict new relation types not 
seen when the compositional model was 
trained (zero-shot learning). We assem¬ 
ble a new dataset of over 52M relational 
triples, and show that our method im¬ 
proves over a traditional classifier by 11%, 
and a method leveraging pre-trained em¬ 
beddings by 7%. 

1 Introduction 

Constructing large knowledge bases (KBs) sup¬ 
ports downstream reasoning about resolved enti¬ 
ties and their relations, rather than the noisy tex¬ 
tual evidence surrounding their natural language 
mentions. For this reason KBs have been of in¬ 
creasing interest in both industry and academia 
(Bollacker et al., 2008; Suchanek et al., 2007; 
Carlson et al., 2010). Such KBs typically con¬ 
tain many millions of facts, most of them (en¬ 
tity 1 ,relation,entity2) “triples” (also known as bi¬ 
nary relations) such as (Barack Obama, presi- 


dentOf USA) and (Brad Pitt, marriedTo, Angelina 
Jolie). 

However, even the largest KBs are woefully in¬ 
complete (Min et al., 2013), missing many impor¬ 
tant facts, and therefore damaging their usefulness 
in downstream tasks. Ironically, these missing 
facts can frequently be inferred from other facts al¬ 
ready in the KB, thus representing a sort of incon¬ 
sistency that can be repaired by the application of 
an automated process. The addition of new triples 
by leveraging existing triples is typically known as 
KB completion. 

Early work on this problem focused on learn¬ 
ing symbolic rules. For example, Schoenmack- 
ers et al. (2010) learns Horn clauses predictive of 
new binary relations by exhausitively exploring re¬ 
lational paths of increasing length, and selecting 
those surpassing an accuracy threshold. (A “path” 
is a sequence of triples in which the second entity 
of each triple matches the first entity of the next 
triple.) Lao et al. (2011) introduced the Path Rank¬ 
ing Algorithm (PRA), which greatly improves ef¬ 
ficiency and robustness by replacing exhaustive 
search with random walks, and using unique paths 
as features in a per-target-relation binary classifier. 
A typical predictive feature learned by PRA is that 
CountryOfHeadquartersfX, Y) is implied by Is- 
Basedln(X.A) and StateLocatedIn(A, B) and Coun¬ 
try Localedlnf B. Y). Given Is Based In( M i eras oft, 
Seattle), StateLocatedIn(Seattle, Washington) and 
CountryLocatedIn( Washington, USA), we can in¬ 
fer the fact CountryOfHeadquarters(Microsoft, 
USA) using the predictive feature. In later work, 
Lao et al. (2012) greatly increase available raw 
material for paths by augmenting KB-schema rela¬ 
tions with relations defined by the text connecting 
mentions of entities in a large corpus (also known 
as OpenIE relations (Banko et al., 2007)). 

However, these symbolic methods can produce 
many millions of distinct paths, each of which is 
categorically distinct, treated by PRA as a dis- 
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tinct feature. (See Figure 1.) Even putting aside 
the OpenIE relations, this limits the applicability 
of these methods to modern KBs that have thou¬ 
sands of relation types, since the number of dis¬ 
tinct paths increases rapidly with the number of re¬ 
lation types. If textually-defined OpenIE relations 
arc included, the problem is obviously far more 
severe. 

Better generalization can be gained by operat¬ 
ing on embedded vector representations of rela¬ 
tions, in which vector similarity can be interpreted 
as semantic similarity. For example, Bordes et al. 
(2013) learn low-dimensional vector representa¬ 
tions of entities and KB relations, such that vector 
differences between two entities should be close 
to the vectors associated with their relations. This 
approach can find relation synonyms, and thus per¬ 
form a kind of one-to-one, non-path-based relation 
prediction for KB completion. Similarly Nickel 
et al. (2011) and Socher et al. (2013a) perform 
KB completion by learning embeddings of rela¬ 
tions, but based on matrices or tensors. Universal 
schema (Riedel et al., 2013) learns to perform rela¬ 
tion prediction cast as matrix completion (likewise 
using vector embeddings), but predicts textually- 
defined OpenIE relations as well as KB relations, 
and embeds entity-pairs in addition to individual 
entities. Like all of the above, it also reasons 
about individual relations, not the evidence of a 
connected path of relations. 

This paper proposes an approach combining the 
advantages of (a) reasoning about conjunctions of 
relations connected in a path, and (b) generaliza¬ 
tion through vector embeddings, and (c) reasoning 
non-atomically and compositionally about the el¬ 
ements of the path, for further generalization. 

Our method uses recurrent neural networks 
(RNNs) (Werbos, 1990) to compose the semantics 
of relations in an arbitrary-length path. At each 
path-step it consumes both the vector embedding 
of the next relation, and the vector representing the 
path-so-far, then outputs a composed vector (rep¬ 
resenting the extended path-so-far), which will be 
the input to the next step. After consuming a path, 
the RNN should output a vector in the semantic 
neighborhood of the relation between the first and 
last entity of the path. For example, after con¬ 
suming the relation vectors along the path Melinda 
Gates —r Bill Gates —> Microsoft —> Seattle , our 
method produces a vector very close to the rela¬ 
tion live sin. 



headquartered in 
headquarters located in 
founded in 
based in 


in the U.S. state of 
located in the state of 
beautiful city in 
in state 


state part of 

state in the NW region of 
located in country 
democratic state in 


3 


Figure 1: Semantically similar paths connecting entity pair 
(Microsoft, USA). 


Our compositional approach allow us at test 
time to make predictions from paths that were un¬ 
seen during training, because of the generaliza¬ 
tion provided by vector neighborhoods, and be¬ 
cause they are composed in non-atomic fashion. 
This allows our model to seamlessly perform in¬ 
ference on many millions of paths in the KB graph. 
In most of our experiments, we learn a separate 
RNN for predicting each relation type, but alterna¬ 
tively, by learning a single high-capacity composi¬ 
tion function for all relation types, our method can 
perform zero-shot learning—predicting new rela¬ 
tion types for which the composition function was 
never explicitly trained. 

Related to our work, new versions of PRA 
(Gardner et al., 2013; Gardner et al., 2014) use 
pre-trained vector representations of relations to 
alleviate its feature explosion problem—but the 
core mechanism continues to be a classifier based 
on atomic-path features. In the 2013 work many 
paths are collapsed by clustering paths accord¬ 
ing to their relations’ embeddings, and substitut¬ 
ing cluster ids for the original relation types. In 
the 2014 work unseen paths arc mapped to nearby 
paths seen at training time, where nearness is mea¬ 
sured using the embeddings. Neither is able to per¬ 
form zero-shot learning since there must be a clas- 
sifer for each predicted relation type. Furthermore 
their pre-trained vectors do not have the opportu¬ 
nity to be tuned to the KB completion task because 
the two sub-tasks are completely disentangled. 

An additional contribution of our work is a 
new large-scale data set of over 52 million triples, 
and its preprocessing for purposes of path-based 
KB completion (can be downloaded from http: 
//iesl.cs.umass.edu/downloads/ 
infe rence rules/re lease, tar. gz). The 
dataset is build from the combination of Freebase 
(Bollacker et al., 2008) and Google’s entity 
linking in Clue Web (Orr et al., 2013). Rather than 
Gardner’s 1000 distinct paths per relation type, we 
have over 2 million. Rather than Gardner’s 200 
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Figure 2: Vector Representations of the paths are computed 
by applying the composition function recursively. 

entity pairs, we use over 10k. All experimental 
comparisons below arc performed on this new 
data set. 

On this challenging large-scale dataset our com¬ 
positional method outperforms PRA (Lao et al., 
2012), and Cluster PRA (Gardner et al., 2013) by 
11% and 7% respectively. A further contribution 
of our work is a new, surprisingly strong baseline 
method using classifiers of path bigram features, 
which beats PRA and Cluster PRA, and statisti¬ 
cally ties our compositional method. Our analysis 
shows that our method has substantially different 
strengths than the new baseline, and the combi¬ 
nation of the two yields a 15% improvement over 
Gardner et al. (2013). We also show that our zero- 
shot model is indeed capable of predicting new un¬ 
seen relation types. 

2 Background 

We give background on PRA which we use to ob¬ 
tain a set of paths connecting the entity pairs and 
the RNN model which we employ to model the 
composition function. 

2.1 Path Ranking Algorithm 

Since it is impractical to exhaustively obtain the 
set of all paths connecting an entity pair in the 
large KB graph, we use PRA (Lao et al., 2011) 
to obtain a set of paths connecting the entity pairs. 
Given a training set of entity pairs for a relation, 
PRA heuristically finds a set of paths by perform¬ 
ing random walks from the source and target nodes 
keeping the most common paths. We use PRA to 
find millions of distinct paths per relation type. We 
do not use the random walk probabilities given by 
PRA since using it did not yield improvements in 
our experiments. 


2.2 Recurrent Neural Networks 

Recurrent neural network (RNN) (Werbos, 1990) 
is a neural network that constructs vector repre¬ 
sentation for sequences (of any length). For exam¬ 
ple, a RNN model can be used to construct vec¬ 
tor representations for phrases or sentences (of any 
length) in natural language by applying a compo¬ 
sition function (Mikolov et al., 2010; Sutskever 
et al., 2014; Vinyals et al., 2014). The vector 
representation of a phrase (mi, m 2 ) consisting of 
words mi and m 2 is given by f{W[v(wi)-,v(w 2 )}) 
where v(w) £ M. d is the vector representation of 
m, / is an element-wise non lineality function, 
[a; b\ represents the concatenation two vectors a 
and b along with a bias term, and W £ M rix2 * r, + i 
is the composition matrix. This operation can 
be repeated to construct vector representations of 
longer phrases. 

3 Recurrent Neural Networks for KB 
Completion 

This paper proposes a RNN model for KB comple¬ 
tion that reasons on the paths connecting an entity 
pair to predict missing relation types. The vec¬ 
tor representations of the paths (of any length) in 
the KB graph are computed by applying the com¬ 
position function recursively as shown in Figure 
2. To compute the vector representations for the 
higher nodes in the tree, the composition function 
consumes the vector representation of the node’s 
two children nodes and outputs a new vector of the 
same dimension. Predictions about missing rela¬ 
tion types are made by comparing the vector repre¬ 
sentation of the path with the vector representation 
of the relation using the sigmoid function. 

We represent each binary relation using a d- 
dimensional real valued vector. We model com¬ 
position using recurrent neural networks (Werbos, 
1990). We learn a separate composition matrix for 
every relation that is predicted. 

Let v r (S) £ W 1 be the vector representation of 
relation 6 and v p (tt) £ M' :/ be the vector represen¬ 
tation of path 7 r. Vpin) denotes the relation vec¬ 
tor if path 7T is of length one. To predict relation 
S = CountryOfHeadquarters, the vector represen¬ 
tation of the path 7r = IsBasedln —> StateLocate- 
dln containing two relations IsBasedln and State- 
Locatedln is computed by (Figure 2), 

v p {tt) = 

f(Ws[v r ( IsBasedln ); v r ( StateLocatedln )]) 



where / = sigmoid is the element-wise non¬ 
linearity function, Wg G ^d* 2 d+i j s t | lc C 0 m p 0 . 
sition matrix for 6 = CountryOfHeadquarters and 
[a; 6 ] represents the concatenation of two vectors 
a G b G along with a bias feature to get a 
new vector [a; b] G M 2d+1 . 

The vector representation of the path II = Is- 
Basedln —> StateLocatedln —y CountryLocatedln 
in Figure 2 is computed similarly by, 

w p(n) = 

f(Ws[v p (n);v r ( CountryLocatedln )]) 

where v p ( 7 r) is the vector representation of path 1s- 
Basedln —> StateLocatedln. While computing the 
vector representation of a path we always traverse 
left to right, composing the relation vector in the 
right with the accumulated path vector in the left 1 . 
This makes our model a recurrent neural network 
(Werbos, 1990). 

Finally, we make a prediction regarding Coun¬ 
try O/Headquarters (Micro soft, USA) using the 
path II = IsBasedln —y StateLocatedln —4 Coun¬ 
tryLocatedln by comparing the vector represen¬ 
tation of the path (v p {\\)) with the vector repre¬ 
sentation of the relation CountryOfHeadquarters 
( v r (CountryOfHeadquarters )) using the sigmoid 
function. 

3.1 Model Training 

We train the model with the existing facts in a 
KB using them as positive examples and nega¬ 
tive examples are obtained by treating the unob¬ 
served instances as negative examples (Mintz et 
al., 2009; Lao et al., 2011; Riedel et al., 2013; Bor- 
des et al., 2013). Unlike in previous work that use 
RNNs(Socher et al., 2011; Iyyer et al., 2014; Irsoy 
and Cardie, 2014), a challenge with using them 
for our task is that among the set of paths connect¬ 
ing an entity pair, we do not observe which of the 
path(s) is predictive of a relation. We select the 
path that is closest to the relation type to be pre¬ 
dicted in the vector space. This not only allows 
for faster training (compared to marginalization) 
but also gives improved performance. This tech¬ 
nique has been successfully used in models other 
than RNNs previously (Weston et al., 2013; Nee- 
lakantan et al., 2014). 

*we did not get significant improvements when we tried 
more sophisticated ordering schemes for computing the path 
representations. 


Algorithm 1 Training Algorithm of RNN model for rela- 
tion S _ 

1 : Input: A s = U A i 5 .4+, number of itera¬ 
tions T, mini-batch size B 
2 : Initialize v r , Wg randomly 
3: for t = 1, 2,... ,T do 
4 : Vv r = 0, VWg = 0 and 6 = 0 

5: for A = ( 7 , 5) G As do 

6: q x = argmax^e^) v p {n).v r (5) 

7 : Accumulate gradients to Vv r , VWg 

8 : using path q\. 

9 : 6 = 6+1 

10 : if 6 = B then 

11 : Gradient Update for v r , Wg 

12 : Viy = 0, VWg = 0 and 6 = 0 

13 : end if 

14 : end for 

15 : if 6 > 0 then 

16 : Gradient Update for iy, Wg 

17 : end if 

18 : end for 
19 : Output: v r ,Wg 


We assume that we are given a KB (for exam¬ 
ple, Freebase enriched with SVO triples) contain¬ 
ing a set of entity pairs F, set of relations A and 
a set of observed facts A + where VA = ( 7 , 6) G 
A + (7 G T. 6 <E A) indicates a positive fact that 
entity pair 7 is in relation <5. Let 47 ( 7 ) denote the 
set of paths connecting entity pair 7 given by PRA 
for predicting relation <5. 

In our task, we only observe the set of paths 
connecting an entity pair but we do not observe 
which of the path(s) is predictive of the fact. We 
treat this as a latent variable (qx for the fact A) 
and we assign qx the path whose vector represen¬ 
tation has maximum dot product with the vector 
representation of the relation to be predicted. For 
example, q\ for the fact A = ( 7 , 5) G A + is given 

by, 

q\ = argmaxn p (7r).rv(c>) 

7rS4>a(7) 

During training, we assign qx using the current 
parameter estimates. We use the same procedure 
to assign qx for unobserved facts that are used as 
negative examples during training. 

We train a separate RNN model for predicting 
each relation and the parameters of the model for 
predicting relation 6 G A are © = {n r (w)Vw G 
A, Wg}. Given a training set consisting of posi- 




tive (A^) and negative (Aj) instances 2 for relation 
(5, the parameters arc trained to maximize the log 
likelihood of the training set with L-2 regulariza¬ 
tion. 

0* = argmax ^ P(y \ = 1;0)+ 

9 A=( 7 ,5)eA+ 

E = 0; 0) — p||0|| 2 

A=( 7 ,<5)eA^ 

where y\ is a binary random variable which takes 
the value 1 if the fact A is true and 0 otherwise, and 
the probability of a fact P(y\ = 1; 0) is given by, 

P(y\ = 1;0) = sigmoid(v p (nx) .v r (6)) 

where fi\ = argmaxn p (7r).ty(c>) 
TrS'f'sG) 

and P(y\ = 0; 0) = 1 - P(y\ = 1;0). The 
relation vectors and the composition matrix are 
initialized randomly. We train the network us¬ 
ing backpropagation through structure (Goller and 
Kuchler, 1996). 

4 Zero-shot KB Completion 

The KB completion task involves predicting facts 
on thousands of relations types and it is highly de¬ 
sirable that a method can infer facts about relation 
types without directly training for them. Given the 
vector representation of the relations, we show that 
our model described in the previous section is ca¬ 
pable of predicting relational facts without explic¬ 
itly training for the target (or test) relation types 
(zero-shot learning). 

In zero-shot or zero-data learning (Larochelle et 
al., 2008; Palatucci et al., 2009), some labels or 
classes arc not available during training the model 
and only a description of those classes arc given 
at prediction time. We make two modifications to 
the model described in the previous section, (1) 
learn a general composition matrix, and (2) fix re¬ 
lation vectors with pre-trained vectors, so that we 
can predict relations that arc unseen during train¬ 
ing. This ability of the model to generalize to un¬ 
seen relations is beyond the capabilities of all pre¬ 
vious methods for KB inference (Schoenmackers 
et al., 2010; Lao et al., 2011; Gardner et al., 2013; 
Gardner et al., 2014). 

We learn a general composition matrix for all 
relations instead of learning a separate composi¬ 
tion matrix for every relation to be predicted. So, 

2 we sub-sample a portion of the set of all unobserved in¬ 
stances. 


for example, the vector representation of the path 
7 r = IsBasedln —»• St at eLo cate din containing two 
relations IsBasedln and StateLocatedln is com¬ 
puted by (Figure 2), 

Vpi.7 r) = 

f(W [v r ( IsBasedln ); v r ( StateLocatedln )]) 

where W € W h2lJ ' 1 is the general composition 
matrix. 

We initialize the vector representations of the 
binary relations (v r ) using the representations 
learned in Riedel et al. (2013) and do not update 
them during training. The relation vectors arc not 
updated because at prediction time we would be 
predicting relation types which are never seen dur¬ 
ing training and hence their vectors would never 
get updated. We learn only the general composi¬ 
tion matrix in this model. We train a single model 
for a set of relation types by replacing the sigmoid 
function with a softmax function while computing 
probabilities and the parameters of the composi¬ 
tion matrix are learned using the available train¬ 
ing data containing instances of few relations. The 
other aspects of the model remain unchanged. 

To predict facts whose relation types arc unseen 
during training, we compute the vector represen¬ 
tation of the path using the general composition 
matrix and compute the probability of the fact us¬ 
ing the pre-trained relation vector. For example, 
using the vector representation of the path II = Is¬ 
Basedln —>• StateLocatedln —> CountryLocatedln 
in Figure 2, we can predict any relation irrespec¬ 
tive of whether they arc seen at training by corn- 
paling it with the pre-trained relation vectors. 

5 Experiments 

The hyperparameters of all the models were tuned 
on the same held-out development data. All the 
neural network models are trained for 150 itera¬ 
tions using 50 dimensional relation vectors, and 
we set the L2-regularizer and learning rate to 
0.0001 and 0.1 respectively. We halved the learn¬ 
ing rate after every 60 iterations and use mini¬ 
batches of size 20. The neural networks and the 
classifiers were optimized using AdaGrad (Duchi 
etal., 2011 ). 

5.1 Data 

We ran experiments on Freebase (Bollacker et al., 
2008) enriched with information from Clue Web. 



Entities 

18M 

Freebase triples 

40M 

ClueWeb triples 

12M 

Relations 

25,994 

Relation types tested 

46 

Avg. paths/relation 

2.3M 

Avg. training facts/relation 

6638 

Avg. positive test instances/relation 

3492 

Avg. negative test instances/relation 

43,160 


Table 1: Statistics of our dataset. 


We use the publicly available entity links to Free- 
base in the ClueWeb dataset (Orr et al., 2013). 
Hence, we create nodes only for Freebase enti¬ 
ties in our KB graph. We remove facts containing 
/type/object/type as they do not give useful pre¬ 
dictive information for our task. We get triples 
from ClueWeb by considering sentences that con¬ 
tain two entities linked to Freebase. We extract the 
phrase between the two entities and treat them as 
the relation types. For phrases that are of length 
greater than four we keep only the first and last 
two words. This helps us to avoid the time con¬ 
suming step of dependency parsing the sentence 
to get the relation type. These triples are similar to 
facts obtained by OpenIE (Banko et al., 2007). To 
reduce noise, we select relation types that occur at 
least 50 times. We evaluate on 46 relation types in 
Freebase that have the most number of instances. 
The methods arc evaluated on a subset of facts in 
Freebase that were hidden during training. Table 
1 shows important statistics of our dataset. 

5.2 Predictive Paths 

Table 2 shows predictive paths for 4 relations 
learned by the RNN model. The high quality of 
unseen paths is indicative of the fact that the RNN 
model is able to generalize to paths that arc never 
seen during training. 

5.3 Results 

Using our dataset, we compare the performance of 
the following methods: 

PRA Classifier is the method in Lao et al. (2012) 
which trains a logistic regression classifier by cre¬ 
ating a feature for every path type. 

Cluster PRA Classifier is the method in Gard¬ 
ner et al. (2013) which replaces relation types from 
ClueWeb triples with their cluster membership in 
the KB graph before the path finding step. Af¬ 


ter this step, their method proceeds in exactly the 
same manner as Lao et al. (2012) training a logis¬ 
tic regression classifier by creating a feature for 
every path type. We use pre-trained relation vec¬ 
tors from Riedel et al. (2013) and use k-means 
clustering to cluster the relation types to 25 clus¬ 
ters as done in Gardner et al. (2013). 
Composition-Add uses a simple element-wise ad¬ 
dition followed by sigmoid non-linearity as the 
composition function similar to Yang et al. (2014). 
RNN-random is the supervised RNN model de¬ 
scribed in section 3 with the relation vectors ini¬ 
tialized randomly. 

RNN is the supervised RNN model described in 
section 3 with the relation vectors initialized using 
the method in Riedel et al. (2013). 

PRA Classifier-b is our simple extension to the 
method in Lao et al. (2012) which additionally 
uses bigrams in the path as features. We add a 
special start and stop symbol to the path before 
computing the bigram features. 

Cluster PRA Classifier-b is our simple extension 
to the method in Gardner et al. (2013) which ad¬ 
ditionally uses bigram features computed as previ¬ 
ously described. 

RNN + PRA Classifier combines the predictions 
of RNN and PRA Classifier. We combine the pre¬ 
dictions by assigning the score of a fact as the sum 
of their rank in the two models after sorting them 
in ascending order. 

RNN + PRA Classifier-b combines the predictions 
of RNN and PRA Classifier-b using the technique 
described previously. 

Table 3 shows the results of our experiments. 
The method described in Gardner et al. (2014) is 
not included in the table since the publicly avail¬ 
able implementation does not scale to our large 
dataset. First, we show that it is better to train the 
models using all the path types instead of using 
only the top 1, 000 path types as done in previous 
work (Gardner et al., 2013; Gardner et al., 2014). 
We can see that the RNN model performs signif¬ 
icantly better than the baseline methods of Lao et 
al. (2012) and Gardner et al. (2013). The perfor¬ 
mance of the RNN model is not affected by initial¬ 
ization since using random vectors and pre-trained 
vectors results in similar performance. 

A surprising result is the impressive perfor¬ 
mance of our simple extension to the classifier 
approach. After the addition of bigram features, 
the naive PRA method is as effective as the Clus- 




Relation: /book/written_work/originaLlanguage/ (book “x” written in language “y”) 

Seen paths: 

/book/written_work/previousJn_series —► /book/written_work/author —> /people/person/nationality —> /people/person/nationality -1 
—> /people/person/languages 

/book/written_work/author —> /people/ethnicity/people -1 —► /people/ethnicity/languages_spoken 

Unseen paths: 

”in” -1 - ’’writer” -1 —y /people/person/nationality -1 —> /people/person/languages 
/book/written_work/author —y addresses —y /people/person/nationality -1 —y /people/person/languages 
Relation: /people/person/place_of_birth/ (person “x” bom in place “y”) 

Seen paths: 

“was,born,in” —> /location/mailing_address/citytown -1 —> /location/mailing_address/state_province_region 
“from” —> /location/location/contains -1 

Unseen paths: 

"born.in” —> /location/location/contains —>■ "near” -1 
“was,born,in” —> commonly,known,as -1 

Relation: /geography/river/cities/ (river “x” flows through or borders “y”) 

Seen paths: 

“at” —> /location/location/contains -1 

"meets,the” —> /transportation/bridge/body_of_water_spanned -1 —> /location/location/contains -1 —y "in” 

Unseen paths: 

/geography/lake/outflow -1 —> /location/location/contains -1 
/geography/lake/outflow -1 —> /location/location/contains -1 —y "near" 

Relation: /people/family/members/ (person “y” part of family “x”) 

Seen paths: 

/royalty/monarch/royaljine -1 —> /people/person/children —> /royalty/monarch/royaLline 
—> /royalty/royaLline/monarchs_from_thisJine 

/royalty/royal_line/monarchs_from_this_line —> /people/person/parents -1 —> /people/person/parents -1 —► /people/person/parents -1 

Unseen paths: 

/royalty/monarch/royaLline -1 —y "leader” -1 —“king” —y “was,married,to” -1 
"of,the” -1 —y “but.also.of” —y “married” —>■ “defended” -1 


Table 2: Predictive paths, according to the RNN model, for 4 target relations. Two examples of seen and 
unseen paths are shown for each target relation. Inverse relations are marked by -1 , i.e, r(x,y ) =L 
r~ 1 (y , x),V(x, y) € r. Relations within quotes are OpenIE (textual) relation types. 



train with 
top 1000 paths 

train with 
all paths 

Method 

MAP 

MAP 

PRA Classifier 

43.46 

51.31 

Cluster PRA Classifier 

46.26 

53.23 

Composition-Add 

40.23 

45.37 

RNN-random 

45.52 

56.91 

RNN 

46.61 

56.95 

PRA Classifier-b 

48.09 

58.13 

Cluster PRA Classifier-b 

48.72 

58.02 

RNN + PRA Classifier 

49.92 

58.42 

RNN + PRA Classifier-b 

51.94 

61.17 


Table 3: Results comparing different methods on 46 types. All the methods perform better when trained 
using all the paths than training using the top 1,000 paths. When training with all the paths, RNN 
performs significantly (p < 0.005) better than PRA Classifier and Cluster PRA Classifier. The small 
difference in performance between RNN and both PRA Classifier-b and Cluster PRA Classifier-b is not 
statistically significant. The best results are obtained by combining the predictions of RNN with PRA 
Classifier-b which performs significantly (p < 10 -5 ) better than both PRA Classifier-b and Cluster PRA 
Classifier-b. 


ter PRA method. The small difference in perfor¬ 
mance between RNN and both PRA Classifier-b 
and Cluster PRA Classifier-b is not statistically 
significant. We conjecture that our method has 


substantially different strengths than the new base¬ 
line. While the classifier with bigram features has 
an ability to accurately memorize important local 
structure, the RNN model generalizes better to un- 






train with 
top 1000 paths 

train with 
all paths 

Method 

MAP 

MAP 

RNN 

43.82 

50.10 

zero-shot 

19.28 

20.61 

Random 

7.59 


Table 4: Results comparing the zero-shot model 
with supervised RNN and a random baseline on 
10 types. RNN is the fully supervised model de¬ 
scribed in section 3 while zero-shot is the model 
described in section 4. The zero-shot model with¬ 
out explicitly training for the target relation types 
achieves impressive results by performing signifi¬ 
cantly (p < 0.05) better than a random baseline. 

seen paths that arc very different from the paths 
seen is training. Empirically, combining the pre¬ 
dictions of RNN and PRA Classifier-b achieves a 
statistically significant gain over PRA Classifier-b. 

5.3.1 Zero-shot 

Table 4 shows the results of the zero-shot model 
described in section 4 compared with the fully su¬ 
pervised RNN model (section 3) and a baseline 
that produces a random ordering of the test facts. 
We evaluate on randomly selected 10 (out of 46) 
relation types, hence for the fully supervised ver¬ 
sion we train 10 RNNs, one for each relation type. 
For evaluating the zero-shot model, we randomly 
split the relations into two sets of equal size and 
train a zero-shot model on one set and test on the 
other set. So, in this case we have two RNNs 
making predictions on relation types that they have 
never seen during training. As expected, the fully 
supervised RNN outperforms the zero-shot model 
by a large margin but the zero-shot model with¬ 
out using any direct supervision clearly performs 
much better than a random baseline. 

5.3.2 Discussion 

To investigate whether the performance of the 
RNNs were affected by multiple local optima is¬ 
sues, we combined the predictions of five different 
RNNs trained using all the paths. Apart from RNN 
and RNN-random, we trained three more RNNs 
with different random initialization and the perfor¬ 
mance of the three RNNs individually arc 57.09, 
57.11 and 56.91. The performance of the ensem¬ 
ble is 59.16 and their performance stopped im¬ 
proving after using three RNNs. So, this indicates 
that even though multiple local optima affects the 


performance, it is likely not the only issue since 
the performance of the ensemble is still less than 
the performance of RNN + PRA Classifier-b. 

We suspect the RNN model does not capture 
some of the important local structure as well as 
the classifier using bigram features. To overcome 
this drawback, in future work, we plan to explore 
compositional models that have a longer memory 
(Hochreiter and Schmidhuber, 1997; Cho et al., 
2014; Mikolov et al., 2014). We also plan to in¬ 
clude vector representations for the entities and 
develop models that address the issue of polysemy 
in verb phrases (Cheng et al., 2014). 

6 Related Work 

KB Completion includes methods such as Lin 
and Pantel (2001), Yates and Etzioni (2007) and 
Berant et al. (2011) that learn inference rules of 
length one. Schoenmackers et al. (2010) learn 
general inference rules by considering the set of 
all paths in the KB and selecting paths that sat¬ 
isfy a certain precision threshold. Their method 
does not scale well to modern KBs and also de¬ 
pends on carefully tuned thresholds. Lao et al. 
(2011) train a simple logistic regression classifier 
with NELL KB paths as features to perform KB 
completion while Gardner et al. (2013) and Gard¬ 
ner et al. (2014) extend it by using pre-trained re¬ 
lation vectors to overcome feature sparsity. Re¬ 
cently, Yang et al. (2014) learn inference rules us¬ 
ing simple element-wise addition or multiplication 
as the composition function. 

Compositional Vector Space Models have been 
developed to represent phrases and sentences in 
natural language as vectors (Mitchell and Lap- 
ata, 2008; Baroni and Zamparelli, 2010; Yesse- 
nalina and Cardie, 2011). Neural networks have 
been successfully used to learn vector representa¬ 
tions of phrases using the vector representations 
of the words in that phrase. Recurrent neural net¬ 
works have been used for many tasks such as lan¬ 
guage modeling (Mikolov et al., 2010), machine 
translation (Sutskever et al., 2014) and parsing 
(Vinyals et al., 2014). Recursive neural networks, 
a more general version of the recurrent neural net¬ 
works have been used for many tasks like pars¬ 
ing (Socher et al., 2011), sentiment classification 
(Socher et al., 2012; Socher et al., 2013c; Irsoy 
and Cardie, 2014), question answering (Iyyer et 
al., 2014) and natural language logical semantics 
(Bowman et al., 2014). Our overall approach is 




similar to RNNs with attention (Bahdanau et al., 
2014; Graves, 2013) since we select a path among 
the set of paths connecting the entity pair to make 
the final prediction. 

Zero-shot or zero-data learning was introduced 
in Larochelle et al. (2008) for character recogni¬ 
tion and drug discovery. Palatucci et al. (2009) 
perform zero-shot learning for neural decoding 
while there has been plenty of work in this direc¬ 
tion for image recognition (Socher et al., 2013b; 
Frome et al., 2013; Norouzi et al., 2014). 

7 Conclusion 

We develop a compositional vector space 
model for knowledge base completion using 
recurrent neural networks. In our challeng¬ 
ing large-scale dataset available at http: 
//iesl.cs.umass.edu/downloads/ 
inferencerules/release.tar.gz, 
our method outperforms two baseline methods 
and performs competitively with a modified 
stronger baseline. The best results are obtained 
by combining the predictions of our model with 
the predictions of the modified baseline which 
achieves a 15% improvement over Gardner et 
al. (2013). We also show that our model has the 
ability to perform zero-shot inference. 
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